feat(model): add quantization support for LLM2Vec text encoder#12
Open
Lee-Jun-Hyuk-37 wants to merge 1 commit intonv-tlabs:mainfrom
Open
feat(model): add quantization support for LLM2Vec text encoder#12Lee-Jun-Hyuk-37 wants to merge 1 commit intonv-tlabs:mainfrom
Lee-Jun-Hyuk-37 wants to merge 1 commit intonv-tlabs:mainfrom
Conversation
Add KIMODO_QUANTIZE env var to load the Llama-3-8B text encoder with reduced precision via bitsandbytes: KIMODO_QUANTIZE=4bit - NF4 4-bit (~5GB VRAM, down from ~17GB) KIMODO_QUANTIZE=8bit - INT8 8-bit (~9GB VRAM) This makes Kimodo usable on consumer GPUs (8-12GB) while retaining full text-prompt support. The quantized model is pinned to its device to avoid errors from .to() calls on quantized weights. Requires: pip install bitsandbytes accelerate
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thank you for this excellent project.
Summary
KIMODO_QUANTIZEenv var to load the Llama-3-8B text encoder with reduced precision via bitsandbytes4bit(NF4, ~5GB VRAM),8bit(INT8, ~9GB VRAM).to()calls on quantized weightsMotivation
Kimodo currently requires ~17GB VRAM, which limits it to high-end GPUs (A100, RTX 3090/4090). Many consumer GPUs have 8-12GB VRAM, which is enough for the diffusion model (~1GB) but not for the full-precision text encoder (~16GB).
This change lets users trade a small amount of text embedding quality for significantly lower VRAM usage, making Kimodo accessible on a much wider range of hardware.
Usage
Requires:
pip install bitsandbytes accelerate